Heart Disease Dataset

In this article, we demonstrate solving a classification problem in TensorFlow using Estimators using the Heart Disease Dataset from the UCI Machine Learning Repository.

Picture Source: harvard.edu

Attribute Information:

  1. Age
  2. Sex
    • 0: Female
    • 1: Male
  3. Chest Pain Type
    • 1: Typical Angina
    • 2: Atypical Angina
    • 3: Non-Anginal Pain
    • 4: Asymptomatic
  1. Serum Cholestoral (in mg/dl )
  2. FBS: Fasting Blood Sugar > 120 mg/dl
    • 0 = False
    • 1 = True
  3. Resting Electrocardiographic Results
    • 0: normal
    • 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
    • 2: showing probable or definite left ventricular hypertrophy by Estes' criteria
  4. Maximum Heart Rate Achieved
  5. Exercise Induced Angina
    • 0: No
    • 1: Yes
  6. Oldpeak = St Depression Induced By Exercise Relative To Rest
  7. Slope: The Slope Of The Peak Exercise ST Segment
    • 1: Upsloping
    • 2: Flat
    • 3: Downsloping
  8. Number Of Major Vessels (0-3) Colored By Flourosopy
  9. Thal
    • 3: Normal
    • 6: Fixed Defect
    • 7: Reversable Defect

Variable to be predicted

Problem Description

The object of the exercise is to develop a predictive model that can predict whether heart disease is present or absent based on the rest of the given features.

Modeling: Tensorflow Boosted Trees Classifier with Feature Importance Analysis

Train and Test sets

Feature Columns

Create the feature columns, using the original numeric columns as is and one-hot-encoding categorical variables.

Input Function

The input function specifies how data is converted to a tf.data.Dataset that feeds the input pipeline in a streaming fashion. Moreover, an input function is a function that returns a tf.data.Dataset object which outputs the following two-element tuple:

Building the input pipeline

Training the model

An alternative way to train a model with boosting performance is using the train_in_memory feature. However, if there is no issue with performance or long training time is not a concern, training without this feature is recommended [5]. Furthermore, our observations have shown that using train_in_memory not always increases the performance of the training.

Feature Importance

We can investigate the feature importance of an artificial classification task. This is similar to that of scikit-learn and has been outlined in [6].

A nice property of DFCs is that the sum of the contributions + the bias is equal to the prediction for a given example.

Feature Importance for a patient

Plot DFCs for an individual patient which is color-coded based on the contributions' directionality and add the feature values on the figure.

Global feature importances

Gain-based feature importances

Gain-based feature importances are built into the TensorFlow Boosted Trees estimators using classifier.experimental_feature_importances.

Average absolute DFCs

Permutation feature importance


References

  1. Detrano, R., Janosi, A., Steinbrunn, W., Pfisterer, M., Schmid, J.J., Sandhu, S., Guppy, K.H., Lee, S. and Froelicher, V., 1989. International application of a new probability algorithm for the diagnosis of coronary artery disease. The American journal of cardiology, 64(5), pp.304-310.

  2. Aha, D. and Kibler, D., 1988. Instance-based prediction of heart-disease presence with the Cleveland database. University of California, 3(1), pp.3-2.

  3. Gennari, J.H., Langley, P. and Fisher, D., 1989. Models of incremental concept formation. Artificial intelligence, 40(1-3), pp.11-61.

  4. Regression analysis. Wikipedia. Last edited on 17 April 2020, at 13:31 (UTC). https://en.wikipedia.org/wiki/Regression_analysis
  5. Tensorflow tutorials, https://www.tensorflow.org/tutorials
  6. TensorFlow Boosted Trees Classifier, https://www.tensorflow.org/api_docs/python/tf/estimator/BoostedTreesClassifier?version=nightly
  7. Lasso (statistics), https://en.wikipedia.org/wiki/Lasso_(statistics).
  8. Tikhonov regularizationm https://en.wikipedia.org/wiki/Tikhonov_regularization.
  9. Palczewska A., Palczewski J., Marchese Robinson R., Neagu D. (2014) Interpreting Random Forest Classification Models Using a Feature Contribution Method. In: Bouabana-Tebibel T., Rubin S. (eds) Integration of Reusable Systems. Advances in Intelligent Systems and Computing, vol 263. Springer, Cham
  10. James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning (Vol. 112, pp. 3-7). New York: springer.
  11. Jordi Warmenhoven, ISLR-python
  12. James, G., Witten, D., Hastie, T., & Tibshirani, R. (2017). ISLR: Data for an Introduction to Statistical Learning with Applications in R